Dataset v2.0 #461

aliberts · 2024-10-03T18:02:10Z

What this does

This PR introduces a new format for LeRobotDataset, which is accompanied by a new file structure. As these changes are not backward compatible, we increase CODEBASE_VERSION from v1.6 to v2.0.

What do I need to do?

If you already pushed a dataset using v1.6 of our codebase, you can use the conversion script lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py to convert it to the new format.
You will be asked to enter a prompt describing the task performed in the dataset.

Examples for single-task dataset:

python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \
    --repo-id lerobot/aloha_sim_insertion_human_image \
    --task "Insert the peg into the socket."

If you recorded your dataset with one of the manipulator robots currently supported in LeRobot (or your own implementation), you can provide its configuration path to add the motor names and robot type to the dataset info using the --robot-config option:

python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \
    --repo-id aliberts/koch_tutorial \
    --task "Pick the Lego block and drop it in the box on the right." \
    --robot-config lerobot/configs/robot/koch.yaml

For the more complicated cases of one task per episode or multiple tasks per episodes, please refer to the documentation in that script.

Motivation

Current implementation of our LeRobotDataset suffers from a few shortcomings which make it not easy to use on some aspects. Specifically:

The structure of the files does not accurately reflect the data structure. Our datasets are structured by episodes, which contrasts with a typical ML scenarios with train/val/test splits (although these concepts can still be relevant here). This makes it hard to easily select a subset of episodes from a dataset since the whole dataset has to be downloaded/loaded. Related: #440
Due to the current hub's limitations, one can not push a dataset with — at most - more than 10k episodes (less if there are multiple cameras).
The format is not transparent to the user: in order to get information about the content of a dataset, current options are limited to download the entire dataset and inspect it with a custom script, or try to visualize it using our visualization tool. Related: #383
The default file cache system used by datasets and huggingface_hub makes it not convenient to create datasets locally (with recording). In order to use the newly created files on disk, these libraries check if those files are present in the cache (which they won't) and if not, will download them even though they may already be on disk.
Some file format used are too framework specific for this format to be more universal (e.g. .safetensors)
The dataset viewer on the hub is not compatible with our datasets due to VideoFrame not yet being integrated into datasets.
The current implementation lacks support for future features that we may want to add such as:
- Task-tokens-conditioned training
- Multirobot policies
- Depth images (Related: #435)

Changes

Some of the biggest change come from the new file structure and their content:

  .
  ├── data
- │   ├── train-00000-of-0001.parquet
+ │   ├── chunk-000
+ │   │   ├── episode_000000.parquet
+ │   │   ├── episode_000001.parquet
+ │   │   ├── episode_000002.parquet
+ │   │   └── ...
+ │   ├── chunk-001
+ │   │   ├── episode_001000.parquet
+ │   │   ├── episode_001001.parquet
+ │   │   ├── episode_001002.parquet
+ │   │   └── ...
+ │   └── ...
- ├── meta_data
+ ├── meta
- │   ├── episode_data_index.safetensors
+ │   ├── episodes.jsonl
  │   ├── info.json
+ │   ├── stats.json
- │   ├── stats.safetensors
+ │   └── tasks.jsonl
  └── videos
+     ├── chunk-000
+     │   ├── observation.images.laptop
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     │   ├── observation.images.phone
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     ├── chunk-001
      └── ...

Note that this file-based structure is designed to be as versatile as possible. The parquet files are split by episodes (this was already the case for videos) which allows a much more granular control over which episodes one wants to use and download. The structure of the dataset is entirely described in the info.json file, which can be easily downloaded or viewed directly on the hub before downloading any actual data. The type of files used are very simple and do not need complex tools to be read, it only uses .parquet, .json, .jsonl and .mp4 files (.md for the README).

Added

A LeRobotDataset can now be called with an episodes argument (e.g. episodes=[1, 10, 12, 40]) to select a specific subset of episodes by their episode_index. By doing so, only the files corresponding to these episodes will be downloaded (if they're not already on disk). In that case, the hf_dataset attribute will only contain data from these episodes, as well as the episode_data_index.
Dataset metadata logic is now handled by the LeRobotDatasetMetadata class. This allows to get info about a dataset before loading the data. For example, you could do this:

# Fetch metadata from the hub
metadata = LeRobotDatasetMetadata("lerobot/pusht")

# Calculate train and val episodes
total_episodes = metadata.total_episodes
episodes = list(range(metadata.total_episodes))
num_train_episodes = math.floor(total_episodes * 90 / 100)
train_episodes = episodes[:num_train_episodes]
val_episodes = episodes[num_train_episodes:]

# Load train and val datasets
train_dataset = LeRobotDataset("lerobot/pusht", episodes=train_episodes)
val_dataset = LeRobotDataset("lerobot/pusht", episodes=val_episodes)

Tasks as natural language prompts are now in every datasets and is needed to create one. Every single task of a dataset is listed in the tasks.json mapped to its task_index which is what's actually stored in parquet files. Using the api, they can be accessed either with dataset.meta.tasks to get that mapping or through dataset.episode_dict[episode_index]["tasks"] if you're only interested in a particular episode.
Various information about the structure of the dataset have been added and is now centralized in the info.json (keys, shapes, number of episodes, etc.). It serves as a source of truth for what's inside the dataset.
episodes.jsonl contains per-episode information (episode_index, tasks in natural language and episode lengths). This is accessed through the episode_dict attribute in the api.
LeRobotDataset.create() allows to create a new dataset from scratch, either for recording data or for porting an existing dataset to the LeRobotDataset format. To that end, new methods are added:
- start_image_writter(): This instantiates an ImageWriter in the image_writer attribute to write images asynchonously during data recording. This is automatically called during LeRobotDataset.create() if specified in the arguments.
- stop_image_writter(): This is to properly stop and remove the ImageWriter from the dataset's attributes. Importantly: if the image_writer has been set to a multiprocess ImageWriter, this needs to be called first if you want to pass this dataset into a parallelized DataLoader as the ImageWriter class is not pickleable (required for objects to be transfered between processes). This is not needed when instantiating a dataset with __init__ as the image_writer then is not created.
- add_frame(): Adds a single timestamp data frame to the episode_buffer, which keep data in memory temporarily. Note: this will be merged with the DataBuffer from #445 in a subsequent PR.
- add_episode(): Saves the content of the episode_buffer to disk and updates metadata for them to be in sync with the contents of the files. This method expects a task argument as a string prompt in natural language describing the task performed in the episode. Videos from that episode can optionally be encoded during this phase but it's not mandatory and can be done later in order to give more flexibility on when to do that.
- consolidate(): This will encode videos that have not yet been encoded, clean up the temporary image files, compute dataset statistics, check timestamps are in sync with the fps and perform additional sanity checks in the dataset. It needs to be done before uploading the dataset to the hub with push_to_hub().
- clear_episode_buffer(): This can be used to reset the episode_buffer (e.g. to discard data from a current recording).

Changed

The logic for checking timestamps and delta_timestamps sync is taken outside of the __getitem__() and is now done during __init__ or consolidate. This has the benefit of both saving computation during the __getitem__() as well as knowing immediately if there are sync issues with the timestamps.
The paths for the parquet and video files are now embedded in the info.json to allow flexibility and to easily split chunks of files between directories to avoid the hub's limit of files (10k) per folder.
We now store every datasets (created or downloaded) in ~/.cache/huggingface/lerobot by default. Changing root or setting the LEROBOT_HOME env variable allows to change that location. Every call to the huggingface_hub download functions like snapshot_download or hf_hub_download use the local_dir argument to that location so that files are not duplicated in cache and to solve the issue of having to download again files already present on disk.
Refactored the image writing code from populate_dataset.py into an ImageWriter class.
stats.safetensors is now stats.json (the content remains the same but it's unflattened).
episode_data_index.safetensors is removed but the episode_data_index is still in the api to map episode_index to indices.

Performance

In the nominal case (no delta_timestamp), LeRobotDataset.__getitem__() is on par with the previous version, sometimes slightly improved but still in the same ballpark generally.

__getitem__() call time in seconds (average on 10k iterations):

repo_id                                 | v1.6   | v2.0  
--------------------------------------- | ------ | ------
lerobot/aloha_sim_insertion_human_image | 0.0036 | 0.0037
lerobot/aloha_sim_insertion_human       | 0.0029 | 0.0027
lerobot/pusht_image                     | 0.0003 | 0.0003
lerobot/pusht                           | 0.0011 | 0.0009
aliberts/koch_tutorial                  | 0.0111 | 0.0106
lerobot/aloha_mobile_cabinet            | 0.0104 | 0.0101

Benchmarking code

from pathlib import Path
import time
import torch
from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset

repo_ids = [
    "lerobot/aloha_sim_insertion_human_image",
    "lerobot/aloha_sim_insertion_human",
    "lerobot/pusht_image",
    "lerobot/pusht",
    "aliberts/koch_tutorial",
    "lerobot/aloha_mobile_cabinet",
]
num_iterations = 10000
logfile = Path(f"perf_log_{CODEBASE_VERSION}_{num_iterations}.txt")
with open(logfile, "a") as file:
    file.write(f"__getitem__() call time in seconds (average on {num_iterations} iterations)\n\n")
    file.write(f"repo_id                                 | {CODEBASE_VERSION}  \n")
    file.write("--------------------------------------- | ------\n")

for repo_id in repo_ids:
    dataset = LeRobotDataset(repo_id=repo_id)
    durations = []
    for i in range(num_iterations):
        start = time.perf_counter()
        item = dataset[i]
        duration = time.perf_counter() - start
        durations.append(duration)

    avg_duration = torch.Tensor(durations).mean()
    results = f"{repo_id} | {avg_duration:.4f}s"
    print(results)
    with open(logfile, "a") as file:
        file.write(results + "\n")

Using delta_timestamps, results are more diverse depending on the dataset but still remain in the same ballpark.
__getitem__() call time in seconds (average on 10k iterations), delta_timestamps=[-1/fps, 0, 1/fps]:

repo_id                                 | v1.6   | v2.0  
--------------------------------------- | ------ | ------
lerobot/aloha_sim_insertion_human_image | 0.0176 | 0.0160
lerobot/aloha_sim_insertion_human       | 0.0073 | 0.0068
lerobot/pusht_image                     | 0.0024 | 0.0032
lerobot/pusht                           | 0.0028 | 0.0043
aliberts/koch_tutorial                  | 0.0200 | 0.0184
lerobot/aloha_mobile_cabinet            | 0.0224 | 0.0181

Benchmarking code (delta_timestamps)

from pathlib import Path
import time
import torch
from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset

repo_ids = [
    "lerobot/aloha_sim_insertion_human_image",
    "lerobot/aloha_sim_insertion_human",
    "lerobot/pusht_image",
    "lerobot/pusht",
    "aliberts/koch_tutorial",
    "lerobot/aloha_mobile_cabinet",
]
num_iterations = 10000
logfile = Path(f"perf_log_{CODEBASE_VERSION}_{num_iterations}.txt")
with open(logfile, "a") as file:
    file.write(f"__getitem__() call time in seconds (average on {num_iterations} iterations)\n\n")
    file.write(f"repo_id                                 | {CODEBASE_VERSION}  \n")
    file.write("--------------------------------------- | ------\n")

for repo_id in repo_ids:
    dataset = LeRobotDataset(repo_id=repo_id)
    fps = dataset.fps
    keys = ["observation.state", *dataset.camera_keys]
    delta_timestamps = {key: [-1/fps, 0, 1/fps] for key in keys}
    dataset = LeRobotDataset(repo_id=repo_id, delta_timestamps=delta_timestamps)
    durations = []
    for i in range(num_iterations):
        start = time.perf_counter()
        item = dataset[i]
        duration = time.perf_counter() - start
        durations.append(duration)

    del dataset
    avg_duration = torch.Tensor(durations).mean()
    results = f"{repo_id} | {avg_duration:.4f}s"
    print(results)
    with open(logfile, "a") as file:
        file.write(results + "\n")

Fixes

Fix a bug in load_previous_and_future_frames which didn't actually raise an error when the requested timestamps from delta_timestamps did not correspond to actual timestamps in the dataset.
Various fixes on the datasets have been made:
- Some tasks already present in some datasets contained strings which were not part of the task (e.g. "tf.Tensor(b'Do something', shape=(), dtype=string)")
- Some video files were not properly tracked by git lfs
- Some datasets present a mismatch between the number of episodes in their parquet and the number of video files. This is being investigated [TODO]
  - lerobot/aloha_mobile_shrimp
  - lerobot/aloha_static_battery
  - lerobot/aloha_static_fork_pick_up
  - lerobot/aloha_static_thread_velcro
  - lerobot/uiuc_d3field
- lerobot/viola is missing video keys [TODO]

How it was tested

Adds tests/fixtures/ in which fixtures and fixtures factories have been added to simplify writing/adding tests. These factories allow the flexibility to create partially mocked objects on the fly to be used in tests, while not relying on other components of the codebase that are not meant to be tested in a particular test (e.g. initializing a dataset using hydra).
Adds tests/test_image_writer.py
Adds tests/test_delta_timestamps.py
Deactivates a bunch of tests which will need to be redesigned and simplified in further PRs.

How to checkout & try? (for the reviewer)

Use an existing dataset:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

REPO_ID = "lerobot/aloha_sim_insertion_human"  # try with '_image' as well

delta_timestamps = {
    "observation.images.top": [-1, -1/50, 0, 25/50],
    "observation.state": [-1, -1/50, 0, 25/50],
}
dataset = LeRobotDataset(repo_id=REPO_ID, delta_timestamps=delta_timestamps)

Try out the new feature to select / download specific episodes:

dataset = LeRobotDataset(repo_id=REPO_ID, episodes=[1, 10, 12, 40])

You can also create a new dataset:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

REPO_ID = "your_hf_username/test_v2"

new_dataset = LeRobotDataset.create(
    repo_id=REPO_ID,
    fps=30,
    robot=robot,
    image_writer_threads_per_camera=1,
)

# TODO
frame = {
    ...
}
new_dataset.add_frame(frame)
new_dataset.add_episode(task="Do something")
new_dataset.consolidate()

…_25_reshape_dataset

lerobot/common/datasets/lerobot_dataset.py

Cadene · 2024-11-22T13:12:25Z

TODO after merging: #485

Cadene

Beautiful work thanks. Left some comments. Hope it helps :)

.github/workflows/test.yml

examples/1_load_lerobot_dataset.py

examples/advanced/2_calculate_validation_loss.py

examples/port_datasets/pusht_zarr.py

tests/test_datasets.py

Cadene · 2024-11-22T16:18:58Z

tests/test_datasets.py

@@ -297,6 +289,7 @@ def test_flatten_unflatten_dict():
    assert json.dumps(original_d, sort_keys=True) == json.dumps(d, sort_keys=True), f"{original_d} != {d}"


+@pytest.mark.skip("TODO after v2 migration / removing hydra")


This test test_backward_compatibility(repo_id): makes me think we should probably train diffusion policy on pusht before and after this PR to compare dataset v1 vs v2.

tests/test_datasets.py

tests/test_policies.py

astroyat · 2024-11-23T23:38:04Z

I tried training using the new dataset and see some errors in compute_stats.py, should d.stats be changed to d.meta.stats?

…_25_reshape_dataset

lerobot/common/datasets/lerobot_dataset.py

Co-authored-by: Remi <[email protected]>

WIP

ad115b6

aliberts added ✨ Enhancement New feature or request 🗃️ Dataset Something dataset-related labels Oct 3, 2024

aliberts self-assigned this Oct 3, 2024

aliberts linked an issue Oct 3, 2024 that may be closed by this pull request

[Feature Request] Add Detailed Information about Observation Fields to Metadata File in leRobotDataset Repository #383

Closed

aliberts added 11 commits October 4, 2024 11:22

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

17a1214

…_25_reshape_dataset

Add upload folders

1016a98

Add info.json link

07e113c

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

028c17f

…_25_reshape_dataset

Add pixel channels

21ba4b5

Update info.json format

2d75b93

Rework LeRobotDataset.__init__

096824b

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

3113038

…_25_reshape_dataset

Update LeRobotDataset.__get_item__

b417ceb

Add doc, scrap video_frame_keys attribute

6d2bc11

Add huggingface-hub patch for offline snapshot_download with local_dir

7f68088

Cadene self-requested a review October 11, 2024 15:10

Add padding keys and download_data option

3ea5312

Cadene reviewed Oct 11, 2024

View reviewed changes

lerobot/common/datasets/lerobot_dataset.py Outdated Show resolved Hide resolved

lerobot/common/datasets/lerobot_dataset.py Outdated Show resolved Hide resolved

lerobot/common/datasets/lerobot_dataset.py Outdated Show resolved Hide resolved

aliberts added 11 commits October 11, 2024 18:52

Add suggestions from code review

8bd406e

Add multitask support, refactor conversion script

cf63334

Extend v1 compatibility

cbc51e1

Fix safe_version

f96773d

Cleanup, fix load_tasks

835ab5a

Update load_tasks doc

da78bbf

WIP add batch convert

9433ac5

Add fixes for batch convert

1102640

Add episode chunks logic, move_videos & lfs tracking fix

c146ba9

Write episodes as jsonlines

50a75ad

Add fixes for lfs tracking

ad3f112

aliberts and others added 10 commits November 18, 2024 18:54

Add comment on license

acae4b4

Improve dataset v2 (#498)

1f13bda

Use HWC for images

6203641

Update example 1

9ee8711

Fix tests

f43e5d0

Enhance dataset cards

c6ad495

Fix conversion script

37da50b

Add open X datasets

93d9bf8

Update example 1

36b9b60

Remove todos

f56d769

Cadene approved these changes Nov 22, 2024

View reviewed changes

aliberts added 7 commits November 25, 2024 12:44

Apply suggestions from code review

23f6c87

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

3b5af7e

…_25_reshape_dataset

Refactor pusht_zarr

6ad84a6

Remove comment

49bdcc0

Activate end-to-end tests

56c01a2

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

691d39a

…_25_reshape_dataset

Comment

2945dca

dirkmcpherson reviewed Nov 27, 2024

View reviewed changes

lerobot/common/datasets/lerobot_dataset.py Outdated Show resolved Hide resolved

aliberts added 5 commits November 28, 2024 10:39

Remove default root data dir, add fixes

2556960

Remove DATA_DIR references

d6b4429

Remove remaining DATA_DIR reference

82ff776

Remove commented code

ea5009e

Remove unused arg

0cb0af0

aliberts merged commit 32eb0ce into main Nov 29, 2024
6 of 7 checks passed

aliberts deleted the user/aliberts/2024_09_25_reshape_dataset branch November 29, 2024 18:04

DomThePorcupine pushed a commit to DomThePorcupine/lerobot that referenced this pull request Dec 2, 2024

Dataset v2.0 (huggingface#461)

9a9a0f5

Co-authored-by: Remi <[email protected]>

alik-git mentioned this pull request Dec 4, 2024

How to make a custom LeRobotDataset with v2? #547

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset v2.0 #461

Dataset v2.0 #461

aliberts commented Oct 3, 2024 •

edited

Loading

Cadene commented Nov 22, 2024

Cadene left a comment

Cadene Nov 22, 2024

astroyat commented Nov 23, 2024

		@@ -297,6 +289,7 @@ def test_flatten_unflatten_dict():
		assert json.dumps(original_d, sort_keys=True) == json.dumps(d, sort_keys=True), f"{original_d} != {d}"


		@pytest.mark.skip("TODO after v2 migration / removing hydra")

Dataset v2.0 #461

Dataset v2.0 #461

Conversation

aliberts commented Oct 3, 2024 • edited Loading

What this does

What do I need to do?

Motivation

Changes

Performance

Fixes

How it was tested

How to checkout & try? (for the reviewer)

Cadene commented Nov 22, 2024

Cadene left a comment

Choose a reason for hiding this comment

Cadene Nov 22, 2024

Choose a reason for hiding this comment

astroyat commented Nov 23, 2024

aliberts commented Oct 3, 2024 •

edited

Loading